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Abstract 

Why do human languages change at some times, and not others? We address this 
longstanding question from a computational perspective, focusing on the case of sound 
change. Sound change arises from the pronunciation variability ubiquitous in every speech 
community, but most such variability does not lead to change. Hence, an adequate model 
must allow for stability as well as change. Existing theories of sound change tend to 
emphasize factors at the level of individual learners promoting one outcome or the other, 
such as channel bias (which favors change) or inductive bias (which favors stability). Here, 
we consider how the interaction of these biases can lead to both stability and change in a 
population setting. We find that population structure itself can act as a source of stability, 
but that both stability and change are possible only when both types of bias are active, 
suggesting that it is possible to understand why sound change occurs at some times and 
not others as the population-level result of the interplay between forces promoting each 
outcome in individual speakers. In addition, if it is assumed that learners learn from two 
or more teachers, the transition from stability to change is marked by a phase transition, 
consistent with the abrupt transitions seen in many empirical cases of sound change. The 
predictions of multiple-teacher models thus match empirical cases of sound change better 
than the predictions of single-teacher models, underscoring the importance of modeling 
language change in a population setting. 


1 Introduction 


Language changes over time: words come and go, pronunciations shift, and the structure 
of sentences mutates, such that the ‘same’ language becomes unintelligible to speakers of 
earlier generations. While language change is far from deterministic, it is often strikingly 
systematic. Indeed, it is the regularity of sound-meaning correspondences between words 
in different languages (e.g. Latin pedis, pater, pisces vs. English foot, father, fish ) that 
licenses hypotheses about a common ancestor. Documenting these sound changes helped 


to establish linguistics as a scientific discipline in the 18th-19th centuries (Jones, 1788), 


resulting in a rich knowledge of what types of sound changes have occurred in the world’s 


languages (Kiimmel 2007, Paul, 1880) 
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For almost as long, linguists have asked why sound change occurs—in particular, why 
particular changes take place, or actuate, at the time and place they do—a question 


which has proven much harder to answer, known as the ‘actuation problem’ (Baker, 2008 


Baker et ah, 2011 Garrett and Johnson, 2013; Weinreich et ah, 1968). One strand of 


research has emphasized the role of universal phonetic pressures or channel biases that 
introduce systematic, potentially asymmetric errors in transmission of a phonetic signal 
between teacher and learner (Blevins, 2004 Moreton, 2008 Ohala| 1993). A commonly 
cited example of a channel bias is coarticulation , which causes a speech sound to be 
produced differently depending on the preceding and following sounds. Sound changes 
such as Germanic i-umlaut, whereby low back vowels were fronted and raised when a 
high front vowel or glide occurred in the following syllable (e.g. Proto-Germanic gasti 
> West Germanic gesti ‘guests’, modern German Gaste ), have been proposed to find 


their source in this kind of conditioned variation (Blevins, 2004; Iverson and Salmons 


2003. Ohala, 1993). This leads to a view of actuation as a two-stage process: first, an 


individual learner interprets a coarticulated variant as conventional (Ohala, 1981); then, 


via a process of cultural transmission, the change subsequently spreads throughout the 


speech community (Labov, 2010 Milroy. 1980). 


While intuitively plausible, important aspects of this model remain to be fully specified. 
First, if channel biases such as coarticulation are universally active, why are all languages 
not constantly changing (Baker 2008; Baker et al. 2011; Weinreich et al., 1968)? It is 
clear that the presence of bias does not invariably result in change: for instance, even while 
umlaut was spreading throughout the West Germanic languages, it did not affect Gothic 
(Cercignani, 1980). An adequate model of sound change must therefore also account 


for the possibility, even ubiquity, of stable variation at the level of the speech commu¬ 
nity. One explanation for stability would be the existence of (possibly domain-general) 
inductive biases guiding human inferences, which may facilitate or inhibit the learn¬ 
ing of certain types of structures or patterns ( Briscoe[ 2000 Griffiths and Kalish 2007 


Kalish et al. 2007 Reali and Griffiths, 2009). Inductive biases have been proposed that 


favour phonetically-motivated hypotheses about phonological patterns over phonetically 


arbitrary ones (e.g. substantive biases: Moreton 


Moreton, 

2008 

Steriade 

2008 

Wilson 

2006 

t 


or 


which promote the stability of existing phonetic category structures over the creation of 
new ones (e.g. categoricity biases: Pierrehumbert 2001, Wedel, 2006). However, if such 


preferences are strong enough to counteract channel bias, then how can change ever occur? 
Finally, when change does diffuse throughout a speech community, it often occurs sud¬ 
denly following a period of prolonged stability (Kroch, 1989; Labov, 2010). What types 


of constraints on transmission and learning might interact to produce this type of rapid 
shift from one stable state to another? 

In this paper, we address these questions by modeling the acquisition and propagation 
of a phonetic parameter in a population setting. Our goal is a model that predicts both 
stability and change in the presence of biases promoting the other outcome, and in which 
small changes in the magnitude of bias produces a sudden and nonlinear change from 
one stable state to another. Because such questions about language change are difficult 
to address empirically, we approach this problem from the perspective of computational 
and mathematical modeling, drawing on a large body of previous work in this tradition 


(Blythe and Croft, 

2012f Boyd and Richerson 

1985} 

Burkett and Griffiths, 

2010) 

Cavalli- 

Sforza and Feldman, 1981| Dediu, 

2009; Griffiths and Kalish 

20071 Griffiths et al. 2013( 


1 Thc terms ‘inductive bias’, ‘analytic bias’ and ‘learning bias’ arc often used interchangeably in this literature; 


see e.g. Moreton (20081; Moreton and Pater (2012). 
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J. Kirby, 2013; S. Kirby et ah, 2007. Komarova et al. 

2001 

Kroch, 1989, Niyogi, 2006 

Niyogi and Berwick, 2009 1997; Pierrehumbert 

2001; Reali and Griffiths, 2009;; Smith 

2009) 

Smith et al. 

2003] 

Sonderegger and Niyogi, 

2010; 

Wedel 

2006 

). Our approach differs 


crucially from previous work in two respects. First, while models of language change often 


frame the learner’s task as choosing between competing discrete variants ( 

Baker et al. 

2011 

Kroch, 1989; Niyogi 2006; Sonderegger and Niyogi, 2010; Wang et al. 

2004] Yang 

2000 

, a key part of learning the sound pattern of a language is learning distributions over 


continuous phonetic parameters, such as vowel formants (Vallabha et al. 2007). Second, in 


most existing models that have considered continuous parameters, change only and always 
occurs in the presence of a channel bias ([Baker 2008 J. Kirby 2013; Pierrehumbert 2001 


Wedel, 2006). Here, we propose a model in which both stability and change of a continuous 


parameter are possible in the presence of channel bias. 

By stability, we are referring to the structure of the stationary distribution of the 
continuous parameter in the population. Might stability at the population level have its 
roots in the inductive biases of individual learners? This seems plausible given work on 
the dynamics of cultural transmission showing that the distribution of a cultural trait 
evolves linearly to a unique stationary state that reflects the structure of learners’ prior 
( [Griffiths and Kalish 2007 S. Kirby et ah, 2007 Reali and Griffiths, 2009). However, 
while this result holds for chains of single teacher-learners, in general the dynamics become 


nonlinear as the population structure becomes more complex (Burkett and Griffiths, 2010 


Dediu, 

2009, 

Niyogi and Berwick 

2009 

Smith 

2009) 


different outcomes—such as stability and change—from similar initial conditions (Niyogi 


and Berwick 2009). In what follows, we thus consider population structures of increasing 


complexity, and assess our models on the basis of their ability to explain how a nonlinear 
transition from stable variation to sound change could occur. 


2 Model 

We consider a scenario in which each agent may (a) function as a learner , receiving input 
from other agents and applying a learning algorithm to this input in order to learn a 
probability distribution over how a continuous parameter is realised, and (b) function as 
a teacher , generating data from this distribution for other learners. Within this framework, 
there are many assumptions one could make about each of these actions. Here we consider 
variants on a simple supervised learning scenario, where all that needs to be learned 
is a distribution over a single phonetic dimension, parametrized by a single continuous 
parameter. For concreteness, our exposition follows the example of umlaut described in 
the Introduction, but the basic results are applicable to the learning of a single continuous 
parameter more generally. 

2.1 Linguistic setting 

We assume that speech sounds have been organised into discrete segments, and that 
the learner has access to the complete segmental inventory. We consider here a simple 
language with the lexicon £ = {Vi,V 2 ,Vi 2 }, where V 12 represents Vi in the context of 
V 2 . Vi and V 2 can be thought of as the vowels /a/ and /i/ in isolation, and V 12 as /a/ 
in a context where it is coarticulated (raised) towards /i/. 

Tokens are represented by their first formant (FI) value, an acoustic measure of vowel 
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height (Hillenbrand et ah, 1995) . We assume that the FI distributions for Vi and V 2 

are normal (N(fi a , a 2 3 4 ), N(jxi, of)), are known to all learners, are the same for all learners, 
and do not change over timej^] The distribution of V 12 is normal, with (fixed) variance as 
for Vi and a mean we denote by c: 


V 12 ~ N(c, of) 


( 1 ) 


We will sometimes refer to V 12 (or equivalently, c, which determines the distribution of 
V 12 ) as the contextual variant. 

In addition, we assume that productions of V 12 are subject to a coarticulatory channel 
bias corresponding to the general tendency in speech production to over- or undershoot 
articulatory targets based on speech context (Lindblom, 1983 Pierrehumbert [2001 ). We 


take this bias to be normally distributed with mean — A (because V 2 has lower FI than Vi) 
and variance u 2 , and to be applied i.i.d. to each vowel token. Thus, the actual productions 
of V 12 by a teacher with contextual variant c follow the distribution 


FI ~ N(c — \, a 2 + u 2 ) 


(2) 


2.2 Learning and evolution 


We assume agents are divided into discrete generations of size M. Each learner in gener¬ 
ation t + 1 receives n examples of V 12 (distributed according to Eq. [ 2 ]) from one or more 
teachers from generation t. The learner’s task is to infer c by application of some learning 
algorithm. 

We assume that learners apply a learning algorithm which is ‘rational’, in the sense that 
they assume that their learning data was generated i.i.d. according to Eq. [I] and estimate 
the most probable value of c. Results are presented below for three learning algorithms 
{Naive learning models , Simple prior models , Complex prior models ) corresponding to 
different assumptions about learners’ inductive biases. Here, we specifically model the 
effect of a categoricity bias, operationalised as a prior over values of c. 

For each case, we consider three population structures (Fig. [ 2 ]), corresponding to the 
number of teachers m in generation t each learner in generation t + 1 receives her learning 
data from: m = 1, m = 2, and m = all (equivalently, rn = M ). These three values are 
chosen as representative for understanding the dynamics when any number m of teachers 
is assumed, which we are interested in in light of previous computational studies highlight¬ 
ing the differences between single- and multiple-teacher scenarios in language evolution 
( |Burkett and Griffiths 2010; Dediu, 2009 Niyogi and Berwick, 2009| Smith, 2009). The 


single-teacher case corresponds most closely to the population structure considered in ‘it- 


erated learning’ models of language evolution (e.g. 

Griffiths and Kalish, 

2007, 

S. Kirby 

et ah, 

2007, 

R.eali and Griffiths 

2009, 

Smith et al. 

CO 

0 

0 

CM 

dl The all-teachers 

case cor- 


responds to the population structure usually assumed in dynamical systems models of 


language change (e.g. Niyogi, 2006 Niyogi and Berwick 1997, Sonderegger and Niyogi 


2010). The two teacher-case is representative of all m between 2 and M — 1, because 
the dynamics turn out to be extremely similar for any m > 1, as we show below. We 


2 Note that FI is inversely related to physical tongue height, so FI is lower for /i/ than for /a/. 

3 This could be taken to mean that learners in generation t receive a very large number of Vi and V 2 examples, 
and learn these distributions perfectly from generation t — 1. 

4 Although our single-teacher scenario is closest to that considered in iterated learning models, there remain 
important differences, as discussed in Appendix [A] 
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Figure 1: Schematic of possible distributions of the contextual variant (c) in the population 
over time, and posible dependence of the mean value of c on system parameters. (A) Stable 
contextual variation: the distribution of c in the population is stable over time, and its mean 
is closer to )i a than to p*. (B) Stable umlaut: the distribution of c is stable over time, and its 
mean is near /ij. (C) Nonlinear transition from stable variation to stable umlaut. The mean of 
the stable population-level distribution of c depends on two parameters: the strengths of the 
coarticulatory channel bias and the categoricity bias. For most parameter values, there is stable 
contextual variation or stable umlaut; a nonlinear transition from one to the other occurs when 
a boundary in parameter space is crossed. 


assume throughout that the m teachers are chosen uniformly from teachers in the previous 
generation, with replacement. 

Considering the ensemble of M teachers in generation t, the state of the population at 
t can be characterized by the random variable C l , whose distribution describes how likely 
different values of c are. Similarly, the values of c learned by the M learners in generation 
t + 1 can be characterized by C t+ 1 . For simplicity, we assume that M is infinite. The 
evolution of the distribution of c is then deterministic, making its behavior more easily 
analyzed as a dynamical system. This and several other aspects of our modeling framework 
(e.g. discrete generations) are shared with previous dynamical systems models of language 
change considering discrete variants (Niyogi, 2006, Niyogi and Berwick, 1997). 

Given a choice of learning algorithm, population structure, and channel bias, we seek to 
characterize the evolution of the distribution of c, and determine to what extent it satisfies 
our modeling goals: stability in the presence of channel bias, change in the presence of 
categoricity bias, and a nonlinear shift from one stable state (where the distribution of c 
does not change over time) to another. We are especially interested in two types of stable 
state — stable contextual variation, where the mean value of c in the population is nearer 
to p a than to p^, and stable umlaut, where this mean is near p,. Fig. [l] exemplifies what 
the distribution of c in the population over time could look like in both cases, as well 
as one possible way in which a nonlinear shift from stable contextual variation to stable 
umlaut could occur as system parameters are varied^ 

In the remainder of the paper, we first consider the simplest case, where learners have 
no prior on values of c ( Naive learning models))', we then consider the effects of introducing 
different types of categoricity bias into the learning algorithm ( Simple prior models and 


5 Note that in the right panel of Fig. [TJ what is important for our purposes is the nonlinearity of the transition 
between stable contextual variation and stable umlaut, rather than the shape of the boundary in parameter 
space across which the transition occurs (which could be any curve, rather than the line shown in Fig. |TJ). 
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Figure 2: Three types of population structure are considered in our models: (a) Single- 
teacher scenario. Each learner in generation t + 1 receives all her learning data from a single 
randomly-chosen teacher in generation t. (b) Multiple-teacher scenario (two teachers). Each 
data point comes from one of two teachers with equal probability, (c) Multiple-teacher scenario 
(M teachers). Each data point comes from a random teacher with equal probability. In (b)- 
(c), teachers are chosen uniformly at random from teachers in generation t (with replacement). 
In all cases, lines of descent may be pruned, i.e. some teachers may not provide data to any 
learners in the following generation. 


Complex prior models ), and conclude by discussing our results. 

For each class of model (naive, simple prior, complex prior), we are interested in the 
evolution of the distribution of c, for which there is no general analytic solution. For the 
naive learning models and simple prior models, we consider how the mean and variance 
of this distribution change over time, which can be derived analytically using techniques 
familiar from the cultural evolution literature (|Boyd and Richerson 1985 Griffiths et al 


2013) and dynamical systems models of language change in discrete variants; derivations 


for all analytic results are are given in Appendix [B|[Cj For the complex prior models, we 
proceed by simulation. 


3 Naive learning models 


We first consider maximum-likelihood (ML) learners, who are ‘naive’ in the sense of having 
no prior over c, and simply choose the value of c under which the likelihood of the data 
(according to Eq. |T|) is highest. 

In the case where each learner in generation t + 1 receives all n examples from a single 
teacher, the evolution of the mean and variance of c are: 

E[C t+1 ] = EIC?] - A (3) 

2 2 

Var(C m ) = Ga + UJ + Var (C*) (4) 


Thus, when there is coarticulation (A > 0), the mean of the contextual variant decreases 
in every generation by an amount equal to the mean amount of channel bias; if there is no 
coarticulation (A = 0), the mean stays the same over time. Regardless of the value of A, 
however, the inter-speaker variability in the realization of the contextual variant increases 
without bound over time. 

Next, consider the case where each learner receives all examples from two teachers. 
The evolution of the mean of the contextual variant (c) in this case is again described by 
Eq. [3j while the variance now rapidly converges to a hxed point, regardless of the initial 
distribution: 


Var(C rf ) -> 


2 (^ + ^ 2 ) 
n — 1 


(5) 
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Figure 3: Evolution of population variance Var(C t ) for different numbers of training examples, 
assuming a% = 60, u = 0 or to = 5, and m = 1, 2, or M. In the single-teacher (m = 1) setting, 
the variance decreases without bound over time, while for two or more teachers, it rapidly 
stabilizes. 

Thus, the mean value of c decreases without bound over time (A > 0) or stays constant 
(A = 0), while the variance quickly stabilizes, in contrast to the single-teacher case. 

In fact, it can be shown that the dynamics are similar for any case where m > 1: 
the mean of c is described by Eq. [3j while its variance moves towards a fixed point. The 
larger m is, the smaller this stable variance is (Appendix |B.2[ ). In the limiting case where 
m = M (Fig. [2^), the variance converges to: 

Var(C t ) ->• a “ + Uj2 (6) 

n — 1 

The evolution of the variance in the single-teacher, two-teacher, and all-teacher cases 
are illustrated in Fig. [3j 

Summary In the naive learning models, if speakers do not coarticulate, the mean 
realization of V 12 in the population remains constant over time, regardless of the number 
of teachers. This is empirically inadequate, as it predicts change from stable contextual 
variation to stable umlaut to be impossible. In the presence of any channel bias (A > 0), 
the mean of c in the population steadily increases over time, again regardless of the number 
of teachers. In this case, change from stable contextual variation to stable umlaut is not 
possible, because stability is not possible, in the sense of a distribution of c which does 
not change over time. This problem is even worse for the single-teacher model, where 
the variance of C t in the population is predicted to steadily increase. As far as we are 
aware, a permanently unstable and unstructured distribution of population-level variation 
in phonetic realization is uncharacteristic of speech communities. 


4 Simple prior models 

The main problem with the naive learning models change becomes inevitable once channel 
bias is introduced. This problem has led to criticism of theories of sound change that rely 
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on the accumulation of incremental change (Baker, 2008; Baker et ah, 2011, Weinreich 


et ah, 1968). However, the inevitability of change in these models is not simply a function 


of the channel bias itself, but also because there is no force acting to counteract the bias. 
Perhaps the simplest type of countervailing force would be to assume that learners have a 
prior categoricity bias over c against values away from pL a . In particular, consider a simple 
gaussian prior centered at fi a with variance r 2 (Fig. |4]4), a type that has previously been 
considered in work on the evolution of a continuous parameter (Griffiths et ah, 2013 we 


compare this study’s results to ours in C.l). 


Learners receive n examples in the same way as in the naive learning models, but 
now their knowledge about contextual variation is probabilistic: a given learner begins 
with a prior distribution on how likely different amounts of contextual variation are a 
priori, which is updated to a posterior distribution based on her data, assuming that 
the distribution of the data given c is given by Eq. [l] She then takes the maximum a- 
posteriori (MAP) estimate as her point estimate c of the contextual variant^] As for the 
naive learning models, we consider the evolution of the mean and variance of C t . 

Regardless of the number of teachers, the mean of C* rapidly moves towards a fixed 
point, namely: 


E[C*] -> /j, a — An— (1, 2, ... M teachers) 


cr- 


(7) 


Thus, the stronger the prior bias against contextual variation there is (smaller r), the 
smaller the eventual mean degree of contextual variation in the population, but increasing 
the strength of the channel bias (larger A) has the opposite effect (Fig. [4^3). 

As in the naive learning models where m > 2, the variance of C* always rapidly moves 
towards a fixed point for all types of population structure. The formulas for these fixed 
points depend on a a , u, r, and n. To get a sense of their essential properties, we write 
them in a form which assumes n>0: 


Var[C m ] 


K 9 K _ 1 

(one) 

(8) 

T n + °^ 

2 K , 1 , 

(two) 

(9) 

v + °y 

-+o{ 4) 

n n z 

(M) 

(10) 


where K = (1 +w 2 /<r 2 ) and O(^) denotes a constant divided by n 2 . While the variance 
always stabilizes over time, even for the single-teacher case, comparing Eqs. [8 -10 shows 
that (for large enough n) just as for the naive learning models, the larger the number of 
teachers, the smaller the eventual amount of population-level variability in c. 


Summary The qualitative evolution of c in simple prior models is the same regardless 
of the magnitude of channel bias (including when A = 0): both the mean and variance 
of the realization of V 12 in the population always move to a stable value. In the limit 
of large n, the stable variance shows an important qualitative difference that depends 
on the number of teachers: while convergence to a form reflecting prior is seen in the 
single-teacher scenario, the stable value in scenarios with two or more teachers does not 
directly reflect the prior (r is not a term in Eqns. [9-10 cf. Griffiths et al., 2013). 


6 The other common strategy for obtaining a point estimate of a posterior distribution, taking the expected 
value, turns out to be equivalent (Appendix [C]) . 



























Figure 4: Simple prior models setup and results, with hi = 530, /j 0 = 730, n = 100, u a = 50. 
(A) Prior distribution over c (lV(/r a ,r 2 )) for values in [hi, ha\- The parameter r controls the 
prior strength, with values closer to 0 corresponding to a greater preference for values of c near 
/ i a . (B) Final population mean of c as a function of channel bias (A) and prior strength (r), 
assuming the minimum value is c = hi (for comparability with Fig. 03). The final mean does 
not depend on the number of teachers or the starting state of the population, and changes 
gradually as A and a are changed. 


The simple prior models allow for stable contextual variation at a value that depends 
on the relative strengths of the channel and categoricity biases. However, these models 
are in some sense too stable: because stability depends on particular values of the system 
parameters, in order for a change to ‘go to completion’ (i.e., to stable umlaut) the system 
parameters would need to be continually changing in each generation—implying that each 
generation coarticulates more than the previous, has a weaker categoricity bias, or both. 
While this is ultimately an empirical question, it seems to us useful to start from the 
assumption that the effects of purportedly universal biases do not change steadily over 
time. In this sense, the simple prior models are inadequate in that there is no threshold 
in the system parameters triggering rapid movement to stable umlaut. 


5 Complex prior models 


The simple prior is indeed a type of categoricity bias, but one that is asymmetrically biased 
entirely toward one of the two pre-existing categories. Here, we consider the ramifications 
of relaxing this assumption, assuming instead that learners have a complex prior which 
weights values of c near both fi a or hi higher than values in between: 

P{c) OC [a(n a - hi) 2 + (c - iha + hi)/ 2 ) 2 } (H) 


The strength of this prior is controlled by a: as a —> 0, values of c near fi a and hi are 
maximally preferred (Fig. [7]A). 

We assume the learner takes the MAP estimate c over the range [hi, ha}- Unlike in 
previous models, the mean and variance of C l cannot be determined analytically, and we 
thus proceeded by simulation to determine the evolution of the distribution of C* over 
time in this case. Technical details of these simulations are given in D.l here we describe 
the basic setup of the simulations, and their results. 
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Figure 5: Evolution of PDF of (thresholded at /c*(c) = 0.0001) from a starting distribution 
of C 1 ~ N(fi a — 10,10), divided into columns by the number of teachers (m), for values of a 
and A which result in stable contextual variation (top row), change to stable umlaut (middle 
row), and similar behavior to the naive prior models (bottom row). fi a = 730, /r* = 530, and 


other parameters listed in Appendix D.l 


The simulations described below consider the evolution of a population that starts 
with a mean realization of V 12 similar to Vi (C 1 ~ fV(/r a — 10, < 7 ^)), in order to determine 
whether both stable contextual variability and change to stable umlaut are possible in this 
model. Of interest is how the strength of the prior (a) and the coarticulatory channel bias 
(A) affect the evolution of the distribution from this starting point, which we examine 
for the same three population structures as in previous models. We first examine the 
evolution of the distribution of C l over time (which we refer to as the trajectory of C t ) for 
particular values of a, A, and m (Trajectories of C 1 ), then examine how the final mean of 
the distribution of C* depends on these three parameters (Final mean of C t ). 

5.1 Trajectories of C u . examples 

We show some qualitatively different ways in which the distribution of C l can evolve, by 
examining the trajectories of C f beginning from C 1 ~ A(/x a — 10,10), for particular values 
of a, A, and m, stopping each simulation when t = 1000. (It is visually clear from the 
results of these simulations, shown in Figs. [b] [6| that the distribution of C l is no longer 
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Figure 6: Evolution of PDF of C n (thresholded at /c‘(c) = 0.0001) from a starting distribution 
of C l ~ N(/i a — 10,10), divided into columns by the number of teachers (m), for values of a 
and A which give qualitatively different behavior for m = 1 and m >1. /i a = 730, /ij = 530, 
and other parameters listed in Appendix 


changing by this point, i.e. has reached a stable state.) 

To get a sense of the effect of the joint effect of the complex prior and channel bias on 
the dynamics of c, we first consider trajectories for three limiting cases, shown in Fig. [5j 

• Case 1: strong prior, weak channel bias (top row: a = 0.001, A = 0.25): For a suffi¬ 
ciently strong prior relative to the strength of the channel bias, contextual variation 
is stable over time (for 1, 2, all teachers).The stable variance of the distribution is 
much larger for m = 1 than for m > 1, and is slightly larger for m = 2 than for 
m = M. 

• Case 2: strong prior, strong channel bias (middle row: a = 0.001, A = 4): For a 
sufficiently strong channel bias relative to the prior strength, change to stable umlaut 
rapidly occurs (for 1, 2, ah teachers). The transition is slightly faster for m > 1 than 
for m = 1. 

• Case 3: weak prior, weak channel bias (bottom row: a = 0.5, 1 = 0): In the 
single-teacher case, the variance rapidly spreads, and ah values of c become roughly 
equiprobable. For more than one teacher, the mean changes little and the variance 
rapidly stabilizes, with the value of the stable variance is slightly larger for m = 2 
than for m = M. These behaviors are similar to the analogous naive learning models, 
as expected given that a sufficiently weak prior is effectively flat. 

In Cases 1-3, the evolution of C t looks qualitatively similar for m = 1, m = 2, and 
m = M, with a significantly larger variance of C l at each time point for the m = 1 case. 
However, there is also a range of (a, A) parameter space where the evolution of C t looks 
qualitatively different depending on the number of teachers. Fig. [6] shows two ways in 
which this can happen: 

• Case 4'- strong prior, medium channel bias (top row: a = 0.001, A = 1.3): Regardless 
of the number of teachers, the stable state of the population shows stable contextual 
variation, in the strict sense defined above, that the mean of c in the population 
is closer to pL a than to pn, but this is realized in qualitatively different ways for 


D.l 
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Figure 7: Complex prior models setup and results, with /i a = 730 and /i ?: = 530, and other 


parameters listed in Appendix D.l. (A) Prior distribution over c (Eq. 11) for values in 


The parameter a controls the strength of the prior, with values nearer to 0 corresponding to 
a greater preference for values of c near either endpoint. (B) Final population mean of c, 
beginning from the same starting state, as a function of channel bias (A) and prior strength (a). 
The final mean of c depends on the number of teachers (1 vs. 2+), and changes nonlinearly 
as A and a are changed. In particular, for 2+ teachers there is a bifurcation: once A is large 
enough relative to a, rapid change to stable umlaut occurs. 


m = 1 and m > 1. In the single-teacher case, the distribution of C t reflects the 
(strong) prior, in the sense that some individuals have values of c near ji a (contextual 
variation) and some have values of c near (umlaut), with a gap in between. 
That is, change to umlaut has ‘gone through’ for some individuals, but not others. 
In contrast, in the multiple teacher cases, the distribution of C t becomes tightly 
clustered around the population mean (which is nearer to p a than to pi). 

• Case 5: medium prior, medium channel bias (bottom row: a = 0.01, A = 1.3): In this 
case, the channel bias is kept at the same value, but the prior is weakened sufficiently 
that change to stable umlaut eventually occurs, regardless of the number of teachers. 
However, the trajectory of C t looks qualitatively different depending on the number 
of teachers. For m = 1, the population contains two types of individuals—those 
with values of c near p a , and those with values of c near p,;—and the proportion of 
the second type becomes greater over time, until the whole population has c near 
fj,i. For m > 1, individuals have values of c tightly clustered around the population 
mean, which steadily changes from near pL a to near /i* over time. 

In Cases 4-5, it is again the case (as in Cases 1-3) that the two-teacher and M-teacher 
cases look very similar, with a slightly larger variance of C t when m = 2. 

Of the trajectories considered above, Cases 1-2 are particularly important: they show 
that both stable contextual variation and stable umlaut are possible, as a and A are 
varied. In particular, it is possible to get change to stable umlaut in the presence of a 
strong categoricity bias—which was not possible in the simple prior model—as well as 
stable contextual variation near pL a in the presence of channel bias. These outcomes are 
two of our modeling goals. We now consider how the final state of the population depends 
on prior strength and channel bias, as a and A are varied between these limiting cases, to 
get a sense of whether the complex prior model meets our final modeling goal: a threshold 
in the system parameters (a and A) which triggers rapid movement to stable umlaut. 
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5.2 Final mean of C* as a function of system parameters 

Fig- HP shows the final mean of c in the population as A and a are varied. In the single¬ 
teacher case (panel 1), stable contextual variation is possible only for the strongest priors 
or when A = 0. As the strength of the prior is relaxed, the population mean comes to rest 
either in an intermediate state, or near m (i.e. stable umlaut). The distribution of in 
an intermediate state often corresponds to Case 4 above: individual learner’s means are 
not tightly clustered around the population mean, but reflect the prior in the sense that 
some individuals are stable near one endpoint (c = n a ) and some near the other (c = /it,;), 
corresponding to an empirical population in which a change has gone through for some 
speakers but not for others. 

In multiple-teacher scenarios (panels 3-4), the results are quite different. There is a 
range of values of prior strength and channel bias which give stable contextual variation. 
However, for a given a, as A is increased past a critical value, there is a rapid shift of the 
population to a stable state where most learners have umlaut (c ~ m). That is, there 
is a bifurcation where the strength of coarticulation has overcome the stabilizing affect 
of the prior. When this happens, the population mean rapidly moves towards the other 
category mean and stabilises. Panels 3-4 also illustrate the tradeoff between categoricity 
and channel biases: for a stronger prior, the critical value of A increases (i.e., the degree 
of coarticulation needed to overcome the prior is greater). 

Summary The complex prior model for multiple teachers meets all three of our model¬ 
ing goals: stability of contextual variation in the face of coarticulation; stability of umlaut 
in the presence of categoricity bias; and rapid change in the population from stable con¬ 
textual variation to stable umlaut as system parameters (a, A) are varied around certain 
values. 


6 Discussion 


This paper has explored how assumptions about channel bias, categoricity bias, and pop¬ 
ulation structure translated into population-level dynamics of a continuous parameter, 
evaluating models by their ability to meet two goals reflecting empirical cases of sound 
change: (1) the possibility of stable contextual variation and change to stable umlaut, in 
the presence of forces promoting the other outcome, and (2) a nonlinear transition from 
stable variation to sound change as a function of system parameters. 

The first goal was met by all models where both a bias promoting change and a 
bias promoting stability were present: in both simple and complex prior settings, stable 
contextual variation can be maintained even in the presence of channel bias, and change 
to stable umlaut can occur even in the presence of categoricity bias. This is an important 
result, for two reasons related to the prevalence of both stable variation and sound change 
in the world’s languages. First, it shows that it is possible to develop a model of sound 


change involving channel bias that does not overapply (Baker, 2008, Baker et al.„ 2011 


Weinreich et al.. 1968). Second, it shows that the distribution of a continuous parameter 


in the population does not necessarily come to reflect the structure of learners’ hypothesis 
space, when other forces (such as channel bias) are present. Convergence to the prior has 


been emphasized in the cultural evolution literature ((Griffiths and Kalish, 

2007 

Griffiths 

et al. 

2013; 

S. Kirby et al. 

2007 

Reali and Griffiths 2009 

), and would not allow for the 


possibility of both stable variation and change to stable umlaut, when the prior reflects 
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both possibilities. Instead, both stability and change are possible in a model where biases 
promoting each outcome are both present. 

Our second goal concerned how stability gives way to change as a function of the 
relative strength of these biases. Models in which learners were equipped with a complex 
prior showed a bifurcation: change from one stable state (contextual variation) to another 
(umlaut) occurred suddenly as the relative strength of the biases favoring each stable state 
is varied past a critical value, at which point ‘actuation’ can be said to have occurred. 
Bifurcations in linguistic populations have been suggested as a key mechanism underlying 
the actuation of linguistic change, but to our knowledge have previously only been shown 


to occur in models of change involving discrete variants (Komarova et ah, 

2001 

Niyogi 

2006, 

Niyogi and Berwick 

2009 

Sonderegger and Niyogi 

2010J. Our demonstration that 


bifurcations are possible in a population of learners of a distribution of a continuous 
parameter supports the hypothesis that bifurcations play a key role in the actuation of 
language change more generally, and suggests that the ongoing empirical quantification 
of forces corresponding to channel and categoricity biases will be crucial to a detailed 


account of sound change actuation (Moreton 2008 Sonderegger and Yu, 2010 Wilson 


2006). 


Turning to the role of population structure, we observed significant differences between 
single- and multiple-teacher settings. These differences are important given the prevalence 


2013. 

Pierrehumbert 

2001 

Wedel, 

2006) 


work on the evolution of discrete linguistic traits (Burkett and Griffiths, 2010, Dediu 


2009, Niyogi and Berwick, 2009 Smith 2009). For naive learners, single-teacher scenarios 


result in ever-increasing population variance. In the simple prior cases, convergence to a 
form reflecting the prior was seen in single-teacher settings (Griffiths et al., 2013), but not 
in multiple-teacher settings. For single teachers in the complex prior setting, the prior was 
reflected not in terms of individual’s distribution of the learned phonetic parameter, but 
in terms of the population-level mixture: rather than a majority of individuals learning 
a phonetic parameter with a value intermediate between the prior endpoints, individuals 
tended to learn a value at one endpoint or the other, with the population consisting of 
a mixture of such individuals. This last result contrasts sharply with abundant sociolin- 
guistic evidence showing that the distribution of linguistic traits in individuals tends to 


mirror that of their speech community (Fruehwald. 2013, Labov, 2010). Conversely, the 


results from multiple-teacher settings are consistent with the finding that social network 
ties can act as a conservative force promoting entrenchment (Milroy, 1980). Overall, our 
results in single- versus multiple-teacher settings suggest that in addition to categoricity 
bias, population structure itself can play a role in promoting stability of existing phonetic 
categories. 

While assuming one versus multiple teachers greatly affected the dynamics, it is impor¬ 
tant to point out our potentially unintuitive finding that models assuming any number of 
teachers greater than one resulted in very similar dynamics. Thus, exactly how population 
structure affects the distribution of a linguistic parameter over time requires further study. 
Given the crucial role that social networks play in the propagation of language change 


(Labov, 2010, Milroy 1980), we are currently extending this framework to handle differ¬ 


ent population structures with more complex teacher-learner relations, including socially 
stratified variation. Future work should also consider different types of biases promoting 
stability and change, such as asymmetries in the extent of contact between members and 
in the social weighting of groups and variants. These are some of many ways in which 
our current framework can be extended to better match the complex reality of sound 
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change. However, even in the relatively simply model presented here, we have shown that 
a solution to the actuation problem is possible: understanding why a language changes, 
or fails to change, requires attention not only those forces promoting change, but their 
interplay with the forces constraining it. 
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t generation 

n number of training examples 

m number of teachers 

kj number of examples drawn from jth teacher 
Hi ith example 

y mean of all examples 

fjj mean of examples drawn from jth teacher 

c mean of FI of V 12 

A mean of channel bias 

oj 2 variance of channel bias 

y a mean of FI of Vi 

a 2 variance of FI of Vi 2 

M size of population 

C f Random variable for contextual variant in generation t 

Table 1: Notation. 


Appendix A Model 

We first review the general setting presented in Model in the main paper. We assume the 
terminology and notation introduced there, with some additions to be used in derivations 
(summarized in Table [I]). 

Each generation at time t consists of M agents, who act as teachers for M learners in 
generation t + 1. Each learner receives n examples, drawn from m teachers, with values 
y = (yi ,..., y n ). A new set of m teachers from generation t is drawn (with replacement) 
for each learner in generation t + 1. For a given learner, which teacher the ith example 
comes from is chosen randomly (each teacher has probability 1/m), and kj denotes the 
number of examples received from the jth teacher, yj the mean FI of the examples received 
from the jth teacher, and y the mean FI of all n examples. 

The distribution of V 12 for an agent with contextual variant c is 

Vi 2 ~ N(c,a 2 a ) (12) 

The random variable C* corresponds to the contextual variant used by members of gen¬ 
eration t. We use lower-case c to refer to draws from this random variable, at times with 
subscripts (Cj will refer to the contextual variant for the jth teacher) or a hat (c will 
refer to an individual learner’s estimate of the contextual variant). All productions of 
V 12 are subject to a channel bias with distribution N(\,lo 2 ). Thus, FI for a teacher with 
contextual variant c is distributed as 


FI ~ N(c — A, a 2 + cj 2 ) 


(13) 


We assume that M is very large (M —> 00 ), in which case the evolution of the distri¬ 
bution of C* is deterministic, and can be described by a dynamical system. Analyzing the 
dynamical system under different assumptions lets us understand how different assump¬ 
tions about bias and population structure affect the population-level distribution of the 
continuous phonetic parameter over time, analogously to existing dynamical systems mod¬ 


els of language change which consider discrete linguistic variants (e.g. Mitchener 2003 


Niyogi 

2006; 

Niyogi and Berwick, 

1997; 

Nowak et al., 

2001 
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It is worth briefly contrasting this setting with that considered in ‘iterated learning’ 
(IL) models which are common in the language evolution literature, where each generation 


consists of a single member (M =1) (e.g. 

Griffiths and Kalish, 

2007 

Griffiths et al. 

2013 

S. Kirby et al., 

2007 

Reali and Griffiths 

2009; Smith et al. 

2003 

. In IL models, the 


state of the population is a stochastic process: it consists of a single value of c at each 
time point, and can be described as a discrete-time Markov chain c . IL studies generally 
examine the evolution of this Markov chain: what would the distribution of values of c* be 
if the chain were iterated a large number of times‘d At time t in any given iteration, there 
is only one value of c. In contrast, in our infinite-population setting we are examining the 
evolution of C t , i.e. the distribution of c in the population at time t. In other words, we 
are interested not in how a single parameter evolves (stochastically) over time, but in how 
the distribution of this parameter in a population evolves (deterministically) over time. 
A more detailed presentation of the difference between iterated learning and the ‘social 
learning’ setting where M = oo is given by Niyogi and Berwick (2009). 


Appendix B Naive learning models 


Here we derive all analytical results referred to in Naive learning models in the main 
paper. 

We first consider maximum-likelihood (ML) learners who are “naive” in the sense of 
having no prior over estimates of c. This setting is closely related to ‘blending inheritance’ 


models of cultural evolution of a quantitative character presented by Boyd and Richerson 


(1985, 71ff), which we make use of below. 


B.l Naive learning models: single teacher 

Each learner in generation t+1 is associated with a value c (one draw from C 4 , representing 
the single teacher’s contextual variant), which is used to generate n training examples for 
that learner. Let Y\,...,Y n be the random variables corresponding to these examples, 
which take on values y = (j/i,..., y n ), and let y be the mean of this sample. Each 
example is normally distributed, following Eq. [2] Because Y\..... Y n are independent and 
normally distributed, their mean is also normally distributed, with the same mean and 
reduced variance: 


f Y 1 +...+Y n (y\C t = c) = Ny(c~ \{a 2 a + u 2 )/n) (14) 

n 

Given y, the learner’s ML estimate of the contextual variant, assuming the data was 
generated by Eq. 12 is c = y. Thus, using Eq. 14 the distribution over values of c the 
learner could acquire given c is: 


/ C t+i(c| C l = c) = Nc(c- A, (a 2 + u 2 )/n) 


(15) 


that is, c is a noisy version of c, decreased by the mean channel bias (A). 

We are interested in the evolution of the distribution of c: that is, the distribution of 
C t+1 as a function of the distribution of C t . It is not in general possible to analytically 


Griffiths and Kalish (2007 pp. 470-471) do consider a continuous-time population-level model as an exten¬ 


sion of their discrete-time M = 1 models, corresponding to a continuous linear dynamical system. However, the 
vast majority of IL studies assume discrete generations of size 1. 
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derive what fet +1 (c) is for an arbitrary f(jt (c). However, we can get a sense of the evolution 
of the distribution of c by examining how its mean and variance change over time. 

To do so, first consider the case where A = 0. The learner’s estimate of c can then be 
written as 


c = V-(c ? + e 


1 


1=1 


n 


(16) 


where q = c and e* ~ 1V(0, cr 2 + w 2 ). In this form, our setting can be related directly to 


the classic ‘blending inheritance’ model of a quantitative character (Boyd and Richerson 


1985, 71ff), where: 

• A child in generation t + 1 takes the mean value of the character from n cultural 
parents (the c,;). 

• Her observation of the ith cultural parent is distorted by a noise term (e,;). 

• The distribution of C l is the distribution over cultural parents in generation t. 


Having made this equivalence, Eqs. 3.21 and 3.22 of Boyd and Richerson (1985) (rewritten 
using our notation) give the evolution of the mean and variance of C l : 


E[C t+1 } = E[C 4 ] (17) 

n n—1 n 1 

Var [C t+1 ] = A( Var [ C *] + <r 2 + a; 2 ) + 2 EE — g (Cov(ej, €j) + VarfC*] • Corr (a, Cj )) 

i=l i =1 j=i -\-1 

(18) 

In our case, Cov(ej,Cj) = 0 (because e* and Cj are independent) and Cor (ci,Cj) = 1 
(because all the c* have the same value). After some algebra, Eq. 18 thus simplifies to 

2 2 

Var[C m ] = Var[C*] + ° a + - (19) 


n 


Eq. 17 and Eq. 19 describe the evolution of the mean and variance of C* when A = 0. 
In the case where A > 0, the learner’s estimate of c changes by subtracting the constant A 
from the right-hand side of Eq. 16, which entails subtracting A from the right-hand side of 
Eq. 17 and 0 from the right-hand side of Eq. 19| 8 The evolution of the mean and variance 
of C l are thus 


E[C t+1 } = EIC*] - A 

(20) 

Var[C m ] = Var[C*] + ^ + ^ 

n 

(21) 


which are Eqs. [3] Id] in the main paper. 


We note that although results from Boyd and Richerson (1985) were used to derive 


these evolution equations, the result that the variance of c increases without bound over 
time (Eq. [U] illustrated in Fig. [3j panel 1) contrasts with their well-known finding 
that blending inheritance in general reduces variance of a quantitative trait over time, 
as emphasized in their discussion (p. 75). However, stable or increasing variance are 


possible for particular cases of Boyd an d Richerson ’s model, such as the case considered 
here where each learner has a single cultural parent and there is noise in estimating the 
parent’s cultural model. 


8 Because if X is a random variable and a is a constant, E[X — a] = E[X] — a and Var [A — a] = Var [A']. 
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B.2 Naive learning models: multiple teachers 

We now consider the case where each learner in generation t + 1 receives n examples from 
m > 1 teachers in generation t. That is, values a,., c m , corresponding to the m teachers, 
are drawn i.i.d. from C*, and the teacher who generates each example is chosen randomly 
(with replacement). We assume that n > lj^] Let kj denote the number of examples 
drawn from the jth teacher {k\ + • • • + k m = n) , and let k = (k±,..., k m ). Thus, k follows 
a multinomial distribution with n trials and event probabilities pi,P 2 , ■ ■ ■ ,Pm = 1 /m. 
Without loss of generality, we can assume that examples 1,..., k\ come from teacher 1, 
examples k\ +1,..., k\ + &2 come from teacher 2, and so on. Let yj denote the mean of the 
examples from the jth teacher. The learner’s ML estimate, c = y, can then be rewritten 
as: 

c = y = — (yi 4- \-y n ) 

n 

j m 

= L kjVj (22) 

71 3 =± 


Note that conditional on kj, each yj can be thought of as the learner’s ML estimate of ca¬ 


using kj examples from a single teacher. Thus, by the same logic used to derive Eqs. 17 
19 we have 


E[y j ] = E[C t ]-A 

2 

Var [yj] = Var[C t ] + — 


(23) 

(24) 


Because c is drawn from C t+l , taking the expectation of Eq. 
23] gives: 
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and substituting in Eq. 


E[C ,+1 ] = E[6] = E[i kj Si \ 

3 = 1 


^ m 

= J2 P ^ E hJ2 k jyj\k} 

k 3 = 1 



=n 


= J2nk)(E[C t ]-X) 

k 

= E[C 4 ] - A (25) 

9 If n = 1, the multiple-teacher case is the same as the single-teacher case already considered. 
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Similarly, taking the variance of Eq. 22 gives: 


VarfC^ 1 ] = Var[c] = Var[^ —yj\ 


( 26 ) 


3 =1 


k- k- 

—yj | k]] (law of total variance) 

3 = 1 i =1 

k 2 


%[£ i Wrfe]] + Var { [£(C‘) - A] 


J = 1 


Eq. [24] 


m ;,2 2 i 2 

£ E|(Var(C*) + ^)] 

y- £[fc j] + EE £ [V k ± 

U 2 Z_^ 3 n n 

3 = 1 J=1 


(27) 


=i 


Using the expressions for E[kj\ and Var [kj] for a multinomial distribution (where pj 
is the probability of the j th outcome): 

E[k]] = E[(k j - E[kj] f] + E[k 3 f 


+ - 


n 


m 


Substituting into Eq. 27 gives 


Var[C t+1 ] = 


= n Pj = m 
yar [kj] 

=n Pj (l- Pj )=n(±){ 1-i) 

nm — n + n 2 
m 2 


Var(C t ) nm — n + n 2 cr 2 + w 2 


(28) 


-m 


ol + u 2 


+ 

m* n 

n + m — 1 


+ VarlC 1 
n ' nm 


The evolution equations of the mean and variance are then 

E[C t+1 ] = E[C l } - A 


u 2 + 0J 2 


Var[C rt+1 ] = -+ Var[C' t ] 


n + m — 1 


n 


nm 


(29) 

(30) 

(31) 


Thus, the mean of c always decreases without bound, as in the single-teacher case (Eq. 
20), regardless of the number of teachers or the number of examples. 

Turning to the variance, define B = n ™™_ 1 . Because m > 1 and n > 1 (by assump¬ 
tion) : 


(n — l)(m — 1) > 0 ==> n + m — 1 < nm 
=> B > l 


(32) 


23 





















The variance evolution equation is an iterated map of the form 


xt+i = K\ + x t /B (33) 

where K\ and B are constants. Because \B\ > 1, the map has a unique fixed point a* 
which it converges to from any starting point (Hirsch et ah, 2004). In particular, letting 
VarfC 1 ] be the variance of c in generation 1, we can rewrite Eq. 30 as 


Var[C' 4 ] = a* + 


(VarfC 1 ] — a * 


-l 


where 


m 


a* = 


B* 


<?n. + U 2 a 


(34) 


(35) 


m — 1 n — 1 

is the fixed point. 

Thus, for multiple teachers, the variance quickly converges to a fixed point a*, with 
its distance from a* decreasing geometrically (Eq. [34j illustrated in Fig. [3j panels 2-3). 
The value of the stable variance decreases as the number of examples (n) or the number 
of teachers (m) is increased. For example, for the two-teacher case (m = 2), Eq. 34 gives 

2(^ + cu 2 ) 


Var[C* 


n — 1 


(two teachers) 


(36) 


which is Eq. [5] in the main paper. For the case where learners learn from all teachers 
(m = M), in the limit considered in our setting where M —> oo, Eq. 34 gives 


Var[C* 


°a +^ 2 

n — 1 


(all teachers) 


(37) 


which is Eq. [6] in the main paper. 


Appendix C Simple prior models 


Here we derive all analytical results referred to in Simple prior models in the main paper. 

In these models, we again assume (as in the naive learning models) that a learner in 
generation t + 1 estimates the mean of the contextual variant based on the assumption 
that her data (from generation t ) is generated i.i.d. from a gaussian source with a fixed c 


(Eq. 12). However, we now assume that she has a gaussian prior over how likely different 

(38) 


values of c are: 


f c t+i(c) = N c (fi a ,T 2 ) 

which is updated to a posterior distribution based on the data ( f C t+i(c \ Y = y))j^] 

In this setting, the learner can be seen as performing a particularly simple case of 


Bayesian linear regression (see e.g. Bishop 2006). where she is finding the constant (c) 
that best matches the mean of the data (y) in the least-squares sense, and there is a 
gaussian prior on possible values of c. The gaussian prior is the conjugate prior, so the 


10 


This learning algorithm is similar to that considered by Griffiths et al. (2013) in a study of the evolution of 


a continuous parameter, but their iterated learning setting (where M = 1) differs from the population setting 
considered here (where M is large), as discussed above. We compare our results to theirs below. 
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posterior distribution of c is gaussian as well. Using Eq. 3.49-3.51 from Bishop (2006), 
the posterior can be shown to bej^] 


f ct+1 (c\Y = y) = N c ( 


y + Ha - 


CC 


1 


-) 


(39) 


1 4 - — n 14 - .. 

' nr 2 ' nr 2 

The learner must pick a point estimate of the contextual variant to use for generat¬ 
ing training data for the next generation. The two common ways of obtaining a point 
estimate from a posterior distribution are taking the maximum a-posteriori value or the 


expected value. These are equivalent for Eq. 39 (because the mean and mode of a normal 
distribution are identical), namely: 


c = 


y + (D- 1 )n a 


D 


where we abbreviate the denominator of Eq. 40 


D = 1 + 


0. 


as 

2 


TIT* 


(40) 


(41) 


Using the same notation as above (Table [I]), we now determine the evolution of the 
mean and variance of C* for a population of simple prior learners whose estimate of c 
is given by Eq. 40 To reduce the number of cases which need to be considered below, 


we assume that n > 1, a a > 0, and r > 0: that is, each learner receives more than 
one example, there is some variability among a speaker’s productions of V 12 , and the 
categoricity prior is not infinitely strong. 


C.l Simple prior models: single teacher 


For the case where m = 1, the distribution of y is still given by Eq. 14, where c is the value 
of the contextual variant used by the single teacher. Because y a and D are constants, 
using Eq. [40j the distribution of c is then 

,c - A + (D - l)na oi + w 2 


f c t+i{c\C = c) = N £ (- 


D 


nD 2 


(42) 


Examining Eq. 15, we see that if X denotes the estimate of c (given the teacher’s value 


of c) in the single-teacher naive learner case, then c for the current case is simply X 


translated and divided by constants: (X + (D — 1 )y a )/D. Thus, Eq. 20 can be used to 
find the evolution of the mean: 


E[C t+1 ] = E[c] = E[ X + ^ D d 

= ±{E[C t ]-\ + (D-l)n a ) 


(43) 


and Eq. 21 can be used to find the evolution of the variance: 


Var[C m ] = Var[c] = Var[ X + ( ^ 1)/X ] 


Var[X] 
D 2 

Vg +^ 2 
nD 2 


+ 


Var[C* 

D 2 


(44) 


Hi 


l In particular, by making these substitutions into Eq. 3.49-3.51: w = (c), S 0 = (r 2 ), /3 = 1/cr 2 , <1? 

n times 

(iT^1~l) t , m o = y a - 
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Now, note that the assumption that a Q ,r > 0 means that D > 1, so that both Eg. |43| 


and Eg. 44 are iterated maps of the form in Eg. 33 with \B\ > 1. These maps have unigue 


stable fixed points, thus, both the mean and the variance of C* rapidly converge to fixed 
points from any starting values. Solving for the fixed points gives: 


12 


EfC 4 ] —> fi a ~ ^ n ~o 


<T. 


Var[C m ] 


(2+S) 


(45) 


(46) 


Eg. [45] is Eg. 0 in the main paper. The mean of the contextual variant in the population 
converges to the value favored by the prior minus an offset which depends on A, n, 
r, a a in intuitive directions: stronger net channel bias (A) over the n examples results in 
lower c, while stronger categoricity bias relative to the amount of production variability 
(r/<T a ) results in c nearer to /i a . 

We discuss the expression for the stable variance below, along with the eguivalent 
expression for the multiple-teacher case. 


C.1.1 Comparison with previous work 

Because our single-teacher simple prior scenario is particularly close to one of the iterated 


learning scenarios considered by Griffiths et al. (2013), it is worth comparing our results to 
theirs to see to what extent they diverge 13 Individual learners in their ‘category defined 


on a single dimension’ setting (pp. 954-956) learn in essentially the same way as our single¬ 
teacher simple prior learners, except that no production bias is applied. In addition, each 
generation consists of one teacher/learner (M = 1), compared to our M = oo. Thus, the 
value of c at each time point is a Markov chain, which we write as c t . In our notation, 


Griffiths et al. show that (p. 966, Eg. 11) 


c* I c 1 ~ 


N(Ha + C l / D <_1 , T 2 


(1 + -^) , 

\ nr z / 

1 - £)-2(t-l) ' 


(47) 


Although each generation in an iterated learning model consists of only one agent, there 
is a natural interpretation of c* in a population context (where M = oo) as describing the 
distribution of C l in a population of teacher/learners, each of whom learn from exactly 
one agent (in generation t — 1) and teach exactly one agent (in generation t) (as pointed 


out by Griffiths et al. 2013, Niyogi and Berwick, 2009). (In other words, the population 
consists of an infinite number of iterated-learning chains run in parallel.) This setting is 
slightly different from our single-teacher case (Fig 1, left panel in the main text), where two 
members of generation t could have the same teacher (and some members of generation 
t — 1 might never serve as teachers). How does this slight difference affect the dynamics? 
We can compare the stable state of the distribution of C l in the two cases by setting 
A = 0, u = 0 in Eg. |45f[46| and taking the limit t 


Ha (both models) 



oo in Eg. 47 


(our setting) 
(iterated learning) 


(48) 

(49) 


12 


E.g. b y setting E[C t+l ] and E[C l ] to x in Eq. 43 and solving for x. 


Griffiths et al. in fact mention ‘the value of a specific formant of a phoneme’ as a motivating case (p. 955). 
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Thus, the distribution of the coarticulation parameter conies to reflect the prior in both 
cases: the mean converges to the mean of the prior (in both models), while the variance 
converges to a value related to r 2 (the width of the prior), but which is smaller in our 
model than in the iterated learning model, by at least a factor of 2 (depending on the 
values of r, o a , and n). The long-term dynamics are therefore similar in the two models, 
but slightly different. 


C.2 Simple prior models: multiple teachers 

When m > 1, we can proceed similarly to the naive-learner multiple teacher case, defining 


kj, yj, etc. in the same way. The learner’s point estimate of c is still given by Eq. 40 
which can be used to rewrite c as: 


- (D - l)/i a y (D - l)fi a 1 
C= D + D = D + ^D (n+ '" + y " ) 
_(D- + 


D 


nD 


(50) 


3 = 1 


As in Naive learning models: single teacher, yj can be thought of as the ML estimate 
made by a naive learner (in generation t + 1) based on drawing kj examples from a single 
teacher in generation t. Also, note that D does not depend on any kj. Because c is drawn 
from C t+1 , taking the expectation of both sides of Eq. [50| gives: 


E[C t+1 } = E[c} = 


(D - l)fj, a 



D 

(D 

~ 1 )ha 


D 

(D 

- 1 )ha 


D 

(D 

~ 1 )ha 


j m 


3 = 1 


+ 


+ 


E 

k 

E 

k 


p ^ E ^Y. k ,m 

3 =1 

1 m 

Eq. [23] 


D 


+ E 


m 

D 


(Ei&j - A) 


ElC*} — A + (ZD — \)fjL a 
D 


(51) 


Thus, the evolution of the mean in the multiple-teacher case (Eq. 51) is the same as in 


the single-teacher case (Eq. 43). In particular, the mean converges to the value given in 
Eq. |45l which gives Eq. [7] in the main paper. 


Similarly, taking the variance of Eq. 50 gives 


Var[C* +1 ] = Var[c] = Var[ (jP ^ a + ^y 3 ] 


3 = 1 


v ”iE 


3 = 1 
1 m b 

Var E -^yj] 

3 = 1 


D 2 


(52) 

(53) 
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The underlined term is the same as Eq. 26 in the naive learner multiple-teacher case, and 


its derivation proceeds identically from that point on (up to Eq. 29), to give 


V-IC«] = ^ + v ^^ 


(54) 


Recall that for multiple teachers (m > 1), provided that n > 1 (which is true, by assump¬ 


tion), we have that \{n + m— l)/nm\ < 1 (Eq. [32[). Thus, because D > 1 as well (Eq. 41), 


the evolution equation for the variance (Eq. |54[ ) is an iterated map of the form in Eq. 33 
with \B\ > 1, which has a unique stable fixed point. Solving for it gives: 


Var[C m ] 


r 2 (l+S) 


(n— l)(m—1) r 2 


+ (2 + ^) 


(55) 


Comparing Eq. 55 with Eq. 041 we see that the stable variance decreases monotonically 


as the number of teachers (m) is decreased, when all other parameters are held constant. 
(This fact is referred to in Naive learning models in the main paper.) In particular, the 
stable variances for the three values of m considered in the main paper (1, 2, oo) are: 


Var[C t+1 ] 


(2 + J&) 

r 2 ( l + £) 


(rc~l) 


+ (2 + i^) 


r 2 (l + ^) 


(n-l)£ + (2 + £Sr) 


(1 teacher) 


(2 teachers) 


(all teachers) 


(56) 


(57) 


(58) 


These expressions for the stable variance are hard to understand intuitively. We can 
get a sense of their behavior by taking n to be large, in accordance with the intuition that 
each learner will receive many examples of a given phonetic category. Taking the Taylor 


expansions of Eqs. 56 -58 in terms of 1/n gives: 
Var[C m ] - 


,(1 + ^) 


- a. 




4n 


n- 


o n 


, 2(1 + 2 *) 


n 




ji" 


O n 


'X l + z?) 


n 


+ °(t2 ) 


n“ 


(1 teacher) 
(2 teachers) 
(all teachers) 


(59) 

(60) 
(61) 


ese are 


Eqs. [s 10 in the main paper.) 


where 0{\) denotes a constant divided by n 2 . (Th 
Thus, there are two important differences between the form of the stable variance for 
m = 1 and m > 1 


14 


First, the stable variance for m = 1 always reflects the prior (in the sense that the 
expression involves r), for any n, while the stable variance for m > 1 does not (r 
only enters into second-order terms). 
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In general, for m > 1 teachers, the 2 in Eq. 60 is replaced by m/(m — 1). 






























• Second, the stable variance for m > 1 goes to 0 as n is increased, while the stable 
variance for m = 1 goes to a constant value which reflects the prior. Thus, a 
population of simple prior learners who receive many examples would eventually 
show no variability in their contextual variants (values of c) for two or more teachers, 
while the same population with single-teacher learning would show variability in their 
contextual variants. 


Appendix D Complex prior models 

Here we provide a more detailed description of the complex prior models, and technical 
details of the complex prior model simulations whose results are given in Complex prior 
models in the main paper. 

In these models, we again assume (as in the Simple Prior models) that learners estimate 
the mean of the contextual variant based on the assumption that data is generated i.i.d. 
from a gaussian source with a fixed c, and that they have a prior over how likely different 
values of c are, which is now given by: 

fc *+1 ( c ) « [a(Mo - Hif + (c - {na + /k)/ 2) 2 ] (62) 

(We write oc instead of = because fct+i(c) must be scaled by some constant to be a 
probability distribution.) The strength of this prior is controlled by a: as a — > 0, values 
of c near jit a and /n are maximally preferred relative to values in between (Fig. [4]A in the 
main paper). 

This prior is updated to a posterior distribution based on the data y. The log of the 
posterior is given by: 

n , n .2 

log(/c*+i ( C W)) = - X V \ 2 + log [ a ^ a “ + ( c ~ (/hi + /h)/ 2 ) 2 ] + constant (63) 

i =1 a 

where the constant is a term which does not depend on c. 

We assume that each learner takes the MAP estimate of c over the interval [/i,;, /x a ] 
based on this posterior. Because this MAP estimate is not in general possible to compute 
analytically, it is also not possible to obtain analytical expressions for the evolution of 
the mean and variance of C t , as in the naive learner and simple prior models. Thus, we 
proceeded by simulation to examine the evolution of C t . 

D.l Simulations: setup 

As an approximation to the deterministic evolution which would result for M = oo, we 
carried out simulations using M = 50000 for the single-teacher setting and M = 2500 
for the multiple-teacher settings. These values were large enough to give behavior very 
close to deterministic for the multiple-teacher settings, and roughly deterministic behavior 
in the single-teacher setting. In the single-teacher setting, it was not possible to obtain 
effectively deterministic behavior for any feasible value of M. This should be kept in 
mind when examining the results of the single-teacher simulations, where there is a small 
stochastic component to the results (relative to M = oo), compared to the multiple-teacher 
simulations, where the results approximate the M = oo case very closely. 

In each simulation run, all parameters except a, A, and m (the number of teachers) 
were set to the same values: fj, a = 730, m = 530, a a = 50, n = 100. Runs were conducted 
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for values of a E [0.001,0.05] and A E [0,2.0], for the single-teacher, two-teacher, and 
M-teacher cases (m = 1,2, M). Each run began by assigning a value of c to the M 
learners in generation 1, drawn according to a starting distribution. Because we are 
primarily interested in the evolution of a population which begins with the contextual 
variant uniformly pronounced similarly to Vi, we always used C 1 ~ N(/j, a — 10,10) as 
the starting distribution. For each of the M learners in generation t, where t > 2, m 
teachers were drawn at random from generation t — 1, and used to generate n = 100 
examples (with the teacher for each example chosen randomly from the m teachers, with 
replacement). The learner’s MAP estimate c for this data was found by maximizing Eq. 
63] over values of c E [iii,fj, a \, using the unidimensional optimize() function in R, which 
uses “a combination of golden section search and successive parabolic interpolation” 


Core Team 2014) 
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For the two and M-teacher cases, simulations were run until t = 2500, at which point 
the distribution of (7* had always reached a stable state (by visual inspection). For the 
single-teacher case, which converged much more slowly, simulations were run until the 
mean, 5th percentile, and 95th percentile of the distribution of C* had (each) not changed 
by more than 2 in 500 generations. If this criterion was not met by t = 10000, the 
simulation was stopped. At this point the distribution of C l had reached a stable state 
for runs corresponding to the dark red and dark blue regions of Fig. HP Panel 1, though 
not necessarily for runs corresponding to the region in between. 


15 


optimize is guaranteed to find the global maximum only if Eq. 63 is unimodal over the interval c E [fi t , p, a ]; 
otherwise, it is only guaranteed to find a local maximum. Whether Eq. [63] is unimodal over this interval in 
general depends on the values of the data (y) and the system parameters (a, er a , etc.). Eq. 63 can be shown to 
be concave on [/q, fi a ] for any y if the condition a > holds, which is the case for almost all simulation 

runs considered here (those with a > 0.0025). Note that concavity is not a necessary condition for optimize 
to find the global optimum, which it seems to nearly always do anyway in our setting. We satisfied ourselves 
that optimize getting stuck in local maxima was not a problem by comparing the results with those obtained 
by using grid search instead, for a subset of the runs. 
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